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@ Aligning texts. 

(57) A plurality of source text files are read, repre- 
senting similar information but in different 
natural languages. The files have conrelated 
layouts, in that the same layout commands are 
employed at similar points in the files. 

Similar text, from respective files, is aligned 
by identifying its position between equivalent 
word processing commands. 

Preferably, intemnediate files are produced in 
which the word processing (WP) commands are 
converted into identifiable fomn. Aligned text, 
which differs between the intermediate files 
whereas WP commands will not differ, is iden- 
tified by a differential comparason operation, 
such as a call to DIFF within a UNIX environ- 
ment 
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FIELD OF THE INVENTION 

The present invention relates to a system for 
aligning source texts of different natural languages to 
produce, or add to, an aligned corpus. 

BACKGROUND OF THE INVENTION 

An aligned corpus consists of words, phrases and 
sentences in a first language, mapped onto substan- 
tially similar words, phrases or sentences in a second 
language. The aligned corpus is used in automated 
translation systems in which, given a word, phrase or 
sentence in a first language, the equivalent in the 
second language may be obtained. Similarly, given a 
word, phrase or sentence in the second language, its 
equivalent in the first language may be obtained. This 
principle may be extended, such that a multi-tingua! 
system may be provided, so that, given a word, 
phrase or sentence in any of the languages available, 
all the others may be translated simultaneously. 

A system for translating text is shown in Figure 1 
and provides an environment for employing an 
aligned corpus. 

Operating instructions and data from the aligned 
corpus are supplied to a processing unit 15 from a 
hard magnetic disk drive 16. Afloppy disk drive 17 re- 
ceives floppy disks containing an input text, in a first 
language, and also receives data relating to an output 
text in a second language, which is written to a sepa- 
rate file on the floppy disk. At the end of the process, 
the floppy disk holds the original file of the input text 
plus, in a separate file, the translated output text. 

In the 1950s and 60s It was a common belief that 
the development of an all purpose translating system 
would become available in the not too distant future. 
It was then realised that such a system was much fur- 
ther off and possibly would never be implemented, 
given the problem of including sufficient background 
information, to facilitate intelligent translation. How- 
ever, it was also appreciated that the problem of pro- 
viding translation within a smaller specalised field 
would be possible, given that many words which have 
many different meanings, would tend to have a much 
limited range of meanings within the confines of a 
specialist field of activity. 

However, a problem of creating a translation sys- 
tem for operation within a specialist field of activity is 
that of generating aligned corpora, given that a cor- 
pus generated for one field of activity would probably 
not be suitable for application in another field of ac- 
tivity. Thus, It would be necessary for users working 
in each field to generate their own corpora. Conse- 
quently, this problem has tended to negate the use of 
such automated systems and reliance continues to be 
made upon human translators. 

The systems shown In Figure 1 could be used, 
rather than a replacement to a translator, as an assis- 



tant to a translator. Thus, each sentence, or part of a 
sentence, could be displayed on an output device, 
such as a visual display unit 18, while information 
could be supplied to the processing unit 15 via an in- 

5 put device, such as a keyboard 1 9. 

The operation of such a system could be in the 
form as shown in Figure 2. As previously stated, an 
aligned corpus 21 is resident on the hard magnetic 
disk drive 16, or similar device, an input file is resident 

10 on the floppy disk drive 17. or similar device and the 
output file is written, after being generated by the 
processing unit 15, to the floppy disk drive 17. In an 
alternative arrangement, two floppy disk drives could 
be provided and the output file could be written to the 

15 second drive. Alternatively, the output file could be 
written to the hard disk drive unit 16 or to any other 
suitable storage device. 

Documents are processed on a page by page ba- 
sis. The flow chart shown in Figure 2 therefore de- 

20 scribes operation of the system with reference to a 
single page. A page may be loaded which does not ac- 
tually contain any information and it is important that 
the system does not become locked-out when it has 
no information to process. At step 24 the question is 

25 posed as to whether the end of the page has been 
reached. If yes, the process stops at step 25. Normal- 
ly, the page will contain text therefore the first sen- 
tence of the Input file is read at step 26. An enquiry 
is now made to the aligned corpus 21 to ask whether 

30 the sentence under consideration is present within 
the corpus, at step 27. If the input sentence is present 
in the corpus, the aligned output sentence is returned 
from the corpus and at step 28 the translated form of 
the sentence is written to the output file. In one em- 

35 bodlment, the operator may be asked to check the 
translation, by means of the translation being sup- 
plied to the visual display unit 18, before the data is 
actually written to the output file. However, in the em- 
bodiment detailed in Figure 2, the translation Is made 

40 automatically, so as to improve processing speed. 

If, in response to the enquiry made at step 27, the 
input sentence is not present in the corpus, the oper- 
ator is prompted to provide an input, via the keyboard 
19, of the correct translation, at step 29. At step 30, 

45 the translation provided by the operator is written to 
the destination file and an enquiry is made to the op- 
erator, at step 31, enquiring as to whether the new 
translation should be added to the corpus. If the op- 
erator responds in the affirmative, the new alignment 

50 is added to the corpus at step 32. If the operator's re- 
sponse is negative, step 32 Is ignored. 

Thus, in response to each requirement to trans- 
late a sentence, three responses become possible. In 
the first, the translation is present in the corpus and 

55 the translation Is automatically written to the output 
file. Alternatively, the sentence is not present in the 
corpus, an input is provided by the operator and the 
translation is then added to the corpus after being 
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written to the output file. Thirdly, the sentence is not 
present in the corpus, again an Input Is provided by 
the operator but this time the new translation is not 
added to the corpus. 

After writing a sentence to the output file, opera- 5 
tion returns to step 24. at which the enquiry is made 
again as to whether the system has reached the end 
of the page. Again, if the response to this enquiry is 
affirmative, another sentence is read at step 26 and 
the procedure is repeated. At the end of the page, as 10 
previously stated, the procedure stops at step 25. 

Thus, it can be seen that, on the assumption that 
similar subject matter is being translated repeatedly, 
the system will learn and entries within the corpus will 
expand. The knowledge base of the corpus will in- 15 
crease and, eventually, an operator providing manual 
translations will no longer be required and an operator 
of minimal skill may be allowed to take over. Possibly, 
several systems may run in parallel and a manual 
translator may be required occasionally to assist non- 20 
skilled operators. 

A problem with the system shown in Figure 2 is 
that it may take a significant resources to build up the 
corpus to the point where the non-skilled operator 
may take over. Initially, it is likely that use of the sys- 25 
tem will actually take longer than a straight forward 
manual translation. Furthermore, it is also highly like- 
ly that systems, possibly operating within the same 
office, will develop differently, with a corpus on one 
being significantly different from a corpus on another. 30 
such that operators would appear to be working at dif- 
ferent rates, again leading to further unpredictability. 

Methods for automatic generation of aligned cor- 
pora have been described for example by W A Gale 
and KW Church in "A Prog ram for Aligning Sentences 35 
in Bilingual Corpora", and by P F Brown Et Al in "Align- 
ing Sentences in Parallel Corpora", both in the Pro- 
ceedings of the 29th Annual Meeting of the Associa- 
tion for Computational Linguistics. Berkeley Califor- 
nia. In these systems, the portions used correspond 40 
to sentences, and alignments is performed by conn- 
paring the lengths of sentences, either In the number 
of words (Brown Et Al) or the number of characters 
(Gale and Church). 

Both of these references exploitthe availability of 45 
the Canadian Hansard in two languages, French and 
English. Brown Et Al further exploit the presence of 
descriptive mark-up codes in the Hansard texts, for 
example codes indicating the times of speeches, the 
names of the speakers and so on. These codes are 50 
used to define anchor points in the text, and prefer- 
ence is given to sentence alignments which preserve 
the alignment of the anchor points. Of course, de- 
scriptive markers are not available in documents in 
general, and are not often in a common language, 55 
even when they are present 

It is an object of the present invention to provide 
an improved system for generating useable aligned 

3 



corpora. It is also an object of the present invention 
to provide a plurality of copies of corpora which may 
be used efficiently within a translating environment. 

The inventors have recognised that, in many cas- 
es, the similar documents which are to be used as the 
source texts are availabe in a form which contains 
presentational formatting data, for example specify- 
ing the size or font to be used for output. Indentations, 
tabulations and other layout attributes. Provided that 
the two source documents have similar presentation- 
al attributes, formatting data included in the source 
files can be used to assist in the alignment. 

Accordingly, a first aspect of the invention pro- 
vides methods and systems for aligning source texts 
of different natural languages to produce or add to an 
aligned corpus, wherein source text files representing 
similar informaion in different natural languages are 
read, and information aligning similar text portions 
from respective files is recorded, characterised in that 
said source text files have similar presentational attri- 
butes, and in that the alignment is performed with ref- 
erence to presentational formatting data present with- 
in said text files. 

The formatting data may be non-textual data, for 
example word processing commands. Where differ- 
ent word processors have been used and generate 
different, possibly non-textual formatting commands', 
these may be converted to generic forms prior to per- 
forming the alignment. 

If the formatting data are converted to textual 
forms prior to performing the alignment, standard text 
file comparison means can be used to identify align- 
ments. 

As an alternative to aligning sentences, it may be 
advantageous for certain classes of documents to 
use the formatting data actually to delimit the aligned 
text portions. 

Thus the problem of generating an aligned cor- 
pus is effectively resolved by making use of texts in 
machine readable form. In particular, reliance is 
made upon correlated texts in different natural lan- 
guages. Two texts are considered to be correlated, as 
defined herein, when they convey the same informa- 
tion but in different natural languages. In addition, 
each page of the correlated texts may contain sub- 
stantially the same information, but in different lan- 
guages, laid out in a similar format. Thus, titles, ta- 
bles, character modifications, may all be present at 
substantially similar positions. 

The invention can be of particular use in the pro- 
duction of multi-lingual product documentation. Many 
products are sold with sophisticated documentation, 
explaining exactly how the product operates. Some- 
times, such documentation may run to many hundred 
pages and must be generated in many different natu- 
ral languages. Consequently, the cost of pnDducing 
such documentation becomes a significant part of the 
total cost for the product itself. Furthermore, the time 
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incurred in generating such documentation may re- 
sult in a significant delay being introduced between 
the date on which the product is available for market 
and the date on which the technical manual is avail- 
able to accompany the product. This often results in 5 
badly written and badly translated documentation, in 
an attempt to get the product to market early. Alter- 
natively, further delay may result in potential sales be- 
ing lost to competitors. 

Many organisations have produced a large num- 10 
ber of manuals, in which each translation is correlated 
to the original text. Thus, for each translation, the 
same WP system has been used as for the original 
and the same formatting has been used. Thus, each 
page of the manual in a first language looks, at f iret 15 
sight, similar to the equivalent page in the equivalent 
manual of a different language, in that headings, 
paragraphs and drawings etc. ail appear in more or 
less the same place. However, the actual words with- 
in the text are different, in accordance with a partic- 20 
ular natural language being used. It is therefore appa- 
rent that a great deal of source material is often avail- 
able which, employing the present invention, may be 
used to produce aligned corpora which are imme- 
diately useable by unskilled operators. Furthermore, 25 
such a procedure will produce corpora that are con- 
sistent, thereby ensuring that ail machines using cop- 
ies of the same corpus are equivalent. 

In certain embodiments, each word processor 
(WP) file is converted into an intermediate file, in 30 
which data relating to specific WP commands, unique 
to a particular WP system, are converted Into a gen- 
eral identifiable form. Thereafter, reference is made 
to the identifiable WP commands, as a means of 
aligning the text held between the layout commands. 35 
which have been placed into identifiable form. 

In a preferred embodiment, different WP com- 
mands for different WP systems are converted to sim- 
ilar Identifiable commands in the respective inter- 
mediate file, it is then possible to identify alignable 40 
text by comparing files to identify differences be- 
tween the files, wherein identifiable WP commands 
are not different between the files. Text portions iden- 
tified as being different are written to the aligned cor- 
pus. ^5 

The Invention yet further provides methods and 
apparatus for automatic translation, wherein informa- 
tion of alignments between text portions has been 
generated and stored by use of the invention as set 
forth above. ^ 

BRIEF DESCRIPTION OF THE DRAWINGS 



Figure 1 shows a system for the automatic trans- 
lation of text; 

Figure 2 illustrates the operation of the system 
shown in Figure 1; 

Figure 3 shows an overview of the present inven- 
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tion, including the creation of intermediate files 
and the comparison of intermediate files; 
Figure 4 details the operation of a first stage of 
the preferred embodiment, concerning the crea- 
tion of intermediate files; and, 
Figure 5 details the operation of a second stage 
of the preferred embodiment, concerning the cre- 
ation of an aligned corpus by the comparison of 
intermediate files 

DETAILED DESCRIPTION OF A PREFERRED 
EMBODIMENT 

Operation of the system for generating an aligned 
corpus, in accordance with the present invention, 
may be performed using hardware substantially 
equivalent to that shown in Figure 1, in which proc- 
essing is performed on the processing unit 15, in re- 
sponse to instructions received from the hard mag- 
netic disk drive 16, or similar device, with output data 
being written to said disk drive 16 orto the floppy disk 
drive 17, or similar device. 

Operation of the system for generating an aligned 
corpus is detailed in Figure 3. 

At step 31 0 it is necessary to generate or procure 
correlated copies in different languages of the same 
documentation. In jsome situations, this documenta- 
tion may not be available. Thus a decision must be 
taken to the effect that all documentation in the fu- 
ture, where translations in several different languag- 
es are required, should be produced in correlated 
form, that is to say, the layout of all versions should 
be similar, so that the WP files contain substantially 
the same WP-specific commands, with only the text 
contained within these commands being actually dif- 
ferent due to the text being written in different natural 
languages. 

In many situations, text of this type may already 
be available and rapid progress may be made, using 
the invention, towards building extensive corpora. In 
particular, texts may have been produced which re- 
late to subject matter similar to that to which a corpus 
is bieng produced for. Thus, machine manuals may 
have been produced relating to particular types of 
machines in which, although developments have 
been made and modifications Introduced, the termi- 
nalogy would tend to be consistent therefore, not only 
does this text provide for the rapid creation of a useful 
corpus, it also ensures that terminalogy used for sub- 
sequent models is consistent with the terminology 
used previously. 

In this example, It is assumed that a corpus is be- 
ing formed which aligns sentences, phrases and 
words of two languages, although as previously stat- 
ed, sentences, phrases and words of more than two 
languages may be aligned. 

At step 320 a first source file is read using the 
process detailed in Figure 4 to produce a first inter- 
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mediate file. An intermediate file is a file in which the 
WP-specific commands have been translated into 
characters which lie within the range of printable 
characters in the character set, such as the ACSII 
character set, and de-limited by a character (or se- s 
quence of characters) identifying them as such. A ta- 
ble is provided to map WP-specific commands onto 
identifable character strings. Thus, when using differ- 
ent WP systems, it is only necessary to amend entires 
in this table and modifications to the rest of the sys- 10 
tern are not required. 

At step 330 the process shown in Figure 4 is re- 
peated to produce a second intermediate file from the 
second source file. Thus, after completing this step, 
two intermediate files are available, derived from the 15 
first language and the second language respectively. 
At step 340 the system shown in Figure 5 is employed 
to compare the intermediate files to produce an 
aligned corpus. Thereafter, at step 350 the question 
is posed as to whether sufficient data has been sup- 20 
plied to the corpus and if this question is answered in 
the negative, the procedure returns to step 310 and 
reads another pair of correlated documents. Thus, 
the number of iterations may be dependent upon the 
number of input files available or if many files are sim- 25 
liar, fewer than all of them may be processed. Again, 
It is also possible that insufficient input files are avail- 
able to produce a corpus of any value and processing 
may have to be put on hold until further correlated 
copies become available. 30 

Once the corpus has been generated and an af- 
firmative answer may be given to the question raised 
at step 350, the corpus may be used in a translation 
system of the type previously described with refer- 
ence to Figure 2, as stated by step 360. 35 

Thus, the generation of an aligned corpus essen- 
tially consists of two stages. The first stage produces 
intermediate files, in which WP commands are con- 
verted Into an identifiable form and the second con- 
sists of comparing correlated intermediate files to 40 
produce entries for the aligned corpus. 

WP data files produced by word processing sys- 
tems contain printable characters, non-printable char- 
acters and other non-character data. The file is effec- 
tively a sequence of bytes, with each byte represent- 4S 
ing a character or some othertype of data. At step 320 
and 330 of the system shown in Figure 3 ASCII char- 
acters defining text are retained in unmodified form. 
Given that ASCII codes, or similar codes, form the ba- 
sis of many WP systems, the code used for each tex- so 
tural character will tend to be the same for each WP 
system. Thus, during the generation of intermediate 
files, textual characters are not modified and these 
characters provide the basis for defining alignments 
which may be supplied to the aligned corpus. 55 

In alternative embodiments, codes other than 
ASCII may be used, such as EBCDIC, BCDIC or a 16- 
bit character set such as UNICODE. 



Unlike the textual characters, the command char- 
acters tend to be used in a way which is specific to any 
one word processing system. The choice of which 
characters are used for a particular representation is 
purely arbitrary. The characters will be generated 
when the file is being created. Then, when the file is 
being printed, the characters will be interpreted by the 
WP system in orcier for suitable instructions to be sup- 
plied to a printer. Usually, each WP system includes 
a plurality of programs, usually referred to as printer 
drivers, which ensure that, in response to the control 
commands generated by the WP system, commands 
appropriate to the specific make of printer being used 
are sent to said printer so as to obtain the desired ef- 
fect. 

In the intermediate files, WP commands have 
been converted into a common identifiable form so as 
to delimit blocks of text which can be aligned with a 
similar block of text in the parallel correlated file. The 
following is a simplified version of a typical input file: 

(a) code - LARGE TEXT 

code - UNDERLINE TEXT 
text 1 

code - NORMAL SIZE 
code - PARAGAPH 
text 2 
text 3 
text 4 

The string of characters in this example first 
of all includes a code specifying that the following 
text is to be increased in size, say, for the purpose 
of providing a heading. A subsequent code states 
that the following text is also to be underlined: 
Thereafter, the string includes a code instructing 
the interpreter to set character size back to nor- 
mal size, followed by another code specifying the 
start of the paragraph. 

An Intermediate file Is generated from the 
above and consists of the following: 

(b) <LT> 

<UL> 
text 1 
<NS> 
<PA> 
text 2 
text 3 

The unprintable codes are converted into print- 
able strings and placed within angled brackets, or any 
other identifying delimiters, so as to identify them as 
such. Thus, the code for large text becomes LT within 
angled brackets and, similarly, the code for underline 
text becomes UL within angled brackets. 

The text is left unmodified, as it is these portions 
of the intermediate files which will be supplied to the 
aligned corpus. The characters placed within angled 
brackets do not need to convey any information as 
such. The purpose of these characters is to provide 
alignment between the two intermediate files, in that 
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a pair of intermediate files derived from correlated in- 
put files, will both include similar sets of WP com- 
mands. 

Thus, considering two intermediate files derived 
from correlated texts, each intermediate file will be s 
initiated by the commands LT and UL within angled 
brackets. This label is then used as a means of align- 
ing the subsequent text. That is to say text 1 of a first 
intermediate file will be aligned with text 1 of a second 
intermediate file. to 

A system for generating intermediate files is 
shown in Figure 4. Each source file 41 may include 
many pages and the file is processed on a page by 
page basis. The file 41 may be in any language there- 
fore, when processing the two source files, the same is 
system may be used for each. The system of Figure 
4 is concerned with the WP commands, wherein, as 
previously stated, characters lying outside the print- 
able ASCII range and WP commands are converted 
to character strings lying within said range, with the 20 
addition of angled brackets to identify them as such. 
Table 42 is dependent upon the type of WP system 
being used and, when using a different WP system, 
it is necessary to replace table 42. Table 42 would, 
therefore, be stored as a separate file on disk 16 for 25 
example and during operation, the specific table re- 
quired is selected by a call to the table file. 

File 41 Is the source input file and a system 
shown in Figure 2 is not arranged to generate a sep- 
arate Intermediate file. The intermediate file is gen- 30 
erated by modifying entries in the source file, such 
that the intermediate file generated after completing 
the procedure in Figure 4, occupies the same mem- 
ory locations as the initially read source file 41. 

It is possible, although unlikely, that an Input 35 
source file 41 could be blank, therefore it is important 
that the system shown in Figure 4 does not fail due 
to an inability to identify data within the file. At step 
43 the question is raised, therefore, as to whether an- 
other page exists within file 41 and if this question is 40 
answered In the negative, operation of the system 
stops at step 44. If another page is waiting in file 41 , 
the question at step 43 is answered in the affirmative 
and at step 44 the page is read. 

Systemsforexchanging one entry for another are 45 
known as such and usually, exchanges of this type 
are made by looking sequentially at an input string 
and, as each new character arrives, a comparison is 
made with entries in a look-up table to see whether 
an exchange can be made. In the present application, so 
however, it was appreciated that such an appnDach 
would cause problems, given that different tables 42 
are required for different word processing systems. It 
therefore becomes attractive to perform the opera- 
tion the other way round. Thus, the whole page is held ss 
in memory and table values stored within table 42 are 
read sequentially. Thus, the first value in table 42 is 
read and the whole page is scanned to see whether 



this value exists in the file. If the value does exist in 
the file, entries are exchanged. That is to say, the 
WP-specif ic value is replaced by the new value read 
from table 42. 

Thus, at step 45 the question is raised as to 
whether another entry exists in the conversion table 
42. Initially, this question must be answered in the af- 
firmative, therefore the first entry from table 42 is 
read at step 46. At step 47 a question is raised as to 
whether the entry read from table 42, a WP-specific 
entry has been found in the page read from file 41. If, 
after scanning the whole page, no such entry is 
found, the question raised at step 47 is answered In 
the negative and the enquiry at step 45 Is raised 
again, as to whether another entry is present in the 
conversion table. If an entry is found in the page, the 
exchange is made at step 48 and at step 49 the scan- 
ning process continues by the question being rasied 
as to whether the end of the page has been reached. 
If no, scanning continues by returning to step 47, en- 
quiring as to whether the entry is present in the docu- 
ment. Thus, a complete scan for the entry is made 
and the scanning process completed by an inability to 
find an entry, detected at step 47 or by the end of the 
page being reached, identified at step 49. 

After the page has been scanned for an entry in 
table 42, the question at step 45 is raised again, as 
to whether another entry is present in the conversion 
table. After all the entries in the conversion table have 
been scanned through the page under consideration, 
the question raised at step 45 is answered In the neg- 
ative followed by the repeat of the question raised at 
step 43, as to whether another page is present. If an- 
other page is present, this is read from the file 41 and 
the process Is repeated. Eyentually all of the pages 
will be read from the file 41 and the question raised 
at step 43 will be answered in the negative, resulting 
in the process stopping at step 44. 

The system for producing an aligned corpus, de- 
fined at step 44 in Figure 3 is detailed in Figure 5. 

The system described with reference to Figure 4 
has been used twice to create two intermediate files 
51, 52. The intermediate files are derived from corre- 
lated parallel files written in different natural languag- 
es, supplied to the system via floppy disks and floppy 
disk drive units 17. 

The system is initated at step 53 whereafter, at 
step 54, the two intermediate files 51, 52 are com- 
pared by the apparatus under control of a differential 
file comparator program, of the type commercially 
available. For example, a suitable file comparator pro- 
gram is DIFF, which is provided with and is callable 
from UNIX operating systems. 

DIFF reports differences between two files, 
which is expressed as a minimal list of line edits (or 
recipes) required to bring either file into agreement 
with the other. The Intermediate files 51, 52 provide 
inputs to a DIFF call, which In turn produces a list of 
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recipes required to convert lines of file 51 to lines of 
file 52. Thus, lines which do not require any modifica- 
tion will be those containing the WP formatting com- 
mands which are common between the two inter- 
mediate files. Similarly, lines containing correspond- 5 
ing pieces of text will required changes to be made be- 
tween the files. Thus, the DIFF program will identify 
lines which do differ between the two files, which in 
turn represent lines which may be written to the 
aligned corpus 61. fo 

Three types of recipes are produced by the DIFF 
program in its comparison of the two intermediate 
files, consisting of a "delete", an "append", and a 
"change". 

A "delete" recipe marks a piece of text or WP for- is 
matting command In the intermediate file 51 as not 
been present in the intermediate file 52. Such recipes 
are ignored by the system, since they do not provide 
any useful alignment data. 

An "append" recipe marks a piece of text or WP 20 
formatting command in intermediate file 52 as not 
been present in intermediate file 51. Similarly, these 
"append" recipes are Ignored by the system since 
they do not provide any useful alignment data. 

A "change" recipe will mark a piece of text from 25 
intermediate file 51 and a matching piece of text from 
intermediate file 52. It is these "change" recipes 
which provide useful alignment data. 

The "change" recipe identifies a range of lines in 
intermediate file 51 as being different from a similar 30 
range of lines in intermediate file 52. This difference 
exists because, although the information content is 
the same, the text for files 51 and 52 are in different 
languages. 

Thus, the alignment is possible because text 35 
which is to be aligned, representing the same infor- 
mation in different languages, is actually different and 
these differences can be identified between the two 
files. However, portions of text which are Identified as 
being different may therefore be aligned, are identi- 40 
fied by the delimiters within the text file. Unlike the 
text, these delimiters would be substantially equiva- 
lent between the two files, given that equivalent for- 
matting commands were used. Thus, portions of the 
text which are equivalent are used to separate por- 45 
tions of the text which are identified as being different 
and these portions of the text which are identified as 
being different then provide the basis for providing in- 
put to the aligned corpus. 

The output of step 54 consists of a list of recipes so 
produced by the DIFF program for the intermediate 
files 51, 52. Each recipe is read in turn at step 55 and 
if no more recipes are present, the procedure termin- 
ates at step 63. If a recipe is available to be read, it is 
read and checked at step 56 to see whether it is a 55 
"change" recipe. If it is not a "change" recipe, the pro- 
cedure returns to step 55 and reads the next recipe. 
If it is a "change" recipe, step 57 extracts the text of 



language one from the recipe and step 58 extracts the 
text of language two. From the texts of languages one 
and two derived from steps 57 and 58, an aligned pair 
of corresponding texts if formed at step 59. 

At step 60, a comparison is made as to whether 
this alignment already exists in the aligned corpus 61 . 
if the entry does already exist, resulting in an affirma- 
tive answer to the question raised at step 60. the 
alignment is Ignored and the process repeated for the 
next recipe. If the question raised at step 60, as to 
whether the alignment already exists in the corpus, is 
answered in the negative, the alignment is written to 
the corpus. 

It can be seen, therefore, that by providing a sub- 
stantial number of intermediate files, created using 
the system detailed in Figure 4, the system shown in 
Figure 5 will generate an aligned corpus which may be 
used in combination with the system shown in Figure 
2. Maximum benefit is gained from the system when 
source files, used to generate intermediate files and 
subsequently used to create the aligned corpus, re- 
late to similar subject matter as source files which are 
to be translated by the system. Thus, a family of ma- 
chines, such as photocopiers, laser printers, termi- 
nals, etc, could have their own specific aligned cor- 
pus, generated by using source files produced for 
earlier models. Therafter, this corpus could be used 
for translating the instruction manuals for new mod- 
els, greatly facilitating this procedure in terms of con- 
sistency, relability and speed of production. 

The Invention has been described with reference 
to delimiters being provided by WP commands. Alter- 
natively, other delimiters may be used such as mark- 
ers provided in a document structuring language such 
as the Standard Generalised Markup Language or 
Office Document Architecture. Similarly, typesetting 
commands may be used as provided in languages 
such as TEX, LATEX or TROFF. 



Claims 

1. A system for aligning source texts of different 
natural languages to produce or add to an aligned 
corpus, the system including 

means for reading source text files, repre- 
senting similar information in different natural 
languages; and 

aligning means for determining an align- 
ment of text portions, from respective source 
files, characterised In that said source text files 
have similar presentational attributes and in that 
said aligning means operates with reference to 
presentational formatting data within said text 
files. 

2. A system according to claim 1 wherein said for- 
matting data delimits the text portions to be 
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aligned. 

A system according to claim 1 or 2 wherein said 
formatting data Is non-textual data. 

5 

A system according to claim 1 . 2 or 3 wherein said 
formatting data comprises formatting commands 
generated by a word processing system. 

A system according to claim 3 or 4, wherein said io 
non-textual data is converted Into textual form 
prior to performing said alignment. 



7. 



8. 



A system according to any preceding claim, 
wherein different word processing commands for 
different word processing systems are converted 
into identifiable generic forms prior to performing 
said alignment 

Asystem according to any of claims 1 to 6, where- 
in allgnable text portions are identified by com- 
paring files to identify differences between the 
said files, in which corresponding formatting data 
forms are not different. 

Asystem according to claim 7, wherein differenc- 
es between the two files are identified by a dif- 
ferential file comparator. 



13. Amethod according to claim 10, 11 or 12 wherein 
the formatting data comprises commands gener- 
ated by a word processing system. 



15 



20 



25 



40 



9. A system according to any of claims 1 to 8, where- so 
in pairs of text portions, taken from respective 
source text files and identified as being similarly 
positioned, are written to an aligned corpus. 

10. A method of aligning source texts of different 35 
natural languages to produce or add to an aligned 
corpus, the method comprising: 

reading source textf lies, representing sim- 
ilar information in different natural languages; 
and 

recording information aligning similar text 
portions, from respective files, characterised in 
that said source text files have similar presenta- 
tional attributes, and in that said aligning step is 
performed with reference to presentational for- 45 
matting data present within said text files. 

11. A method according to claim 10 wherein text por- 
tions to be aligned are delimited by said format- 
ting data. QQ 

12. A method according to claim 10 or 11, wherein 
said formatting data is non-textual data. 



14. A method according to claim 12 or 13 wherein 
non-textual data is converted Into textual form 
prior to performing said aligning step. 

15. A method according to daim 13, wherein corre- 
sponding word processing formatting commands 
generated for different word processing systems 
are converted into Identifiable forms prior to per- 
forming said aligning step. 

16. A method according to any of claims 10 to 15, 
wherein allgnable text portions are Identified by 
comparing files to identify differences between 
the said files, in which corresponding formatting 
data forms are not different. 

17. A method according to claim 16, wherein differ- 
ences between the two files are identified by dif- 
ferential file comparison. 

18. A method according to any of claims 10 to 17, 
wherein pairs of text portions, taken from respec- 
tive source text files and Identified as being sim- 
ilarly positioned, are written to an aligned corpus. 

19. A method of automatically translating a subject 
text from a first natural language to a second 
natural language, a method comprising: obtain- 
ing a machine readable recording of aligned text 
portions generated by a method according to any 
of claims 10 to 18, identifying correspondence 
between portions of the subject text in the first 
language and portions present in the recording of 
aligned portions, and outputting corresponding 
text portions in said second language, by refer- 
ence to the recorded alignments. 

20. A machine-readable recording of aligned text por- 
tions generated by a method as claimed in any of 
claims 10 to 18. 



55 



RNSDOCID: <EP 



0507B11A2 I > 



EP0 597 611 A2 



FIG.1 




16 



9 



EP0 597 611 A2 



FIG. 2 



INPUT FILE 



ALIGNED CORPUS 

C5r 



START 



END OF PAGE? 



READ A SENTENCE 
OFINPUT FILE 



^25 



IS SENTENCE 
IN THE CORPUS ? 



t YES 



__^27 



29 



INPUT TRANSLATION 



WRITE TO THE 

OUTPUT FILE 



'28 



7 



WRITE TO THE 

DESTINATION FILE 



31 




I 



■30 



ADD TO CORPUS ? 





YES 


ADD TO 


CORPUS 



NO " 



I 



32 



STOP ^ 



OUT FILE 



23 



10 



EP0 597 611 A2 



FIG. 3 



[ 



GENERATE OR PROCURE 
CORRELATED COPIES IN 
DIFFERENT LANGUAGES 
OF THE SAME DOCUMENT 



I 



310 



READ A FIRST SOURCE FILE 
USING THE PROCESS SHOWN 
IN FIGURE U TO PRODUCE 
A FIRST INTERMEDIATE FILE 



I 



-320 



READ A SECOND SOURCE FILE 
USING THE PROCESS SHOWN 
IN FIGURE/. TO PRODUCEA 
SECOND INTERERMEDIATE FILE 



I 



•330 



PRODUCE AN ALIGNED CORPUS 
BY COMPARING THE 
INTERMEDIATE FILES 
SHOWN IN FIGURE 5 




USE CORPUS IN A 
TRANSLATION SYSTEM OF 
THETYPE SHOWN IN 
FIGURE 2 



•360 



11 



EPO 597 611 A2 



c 

< 



START 



ANOTHER PAGE 



NO 



YES 



A3 



READ PAGE 



NO y 

1^ 



ANOTHER ENTRY 
IN CONVERSION TABLE 



> 



YES 





READ ENTRY 
FROM TABLE 











47 



< 



ENTRY FOUND 
IN DOCUMENT 



48 




EXCHANGE ENTRY 





NO 



< 



END OF PAGE? 



YES 



49 



STOP J-*- 



12 



BNSDOCID: <EP 0597611A2J_> 



EP0 597 611 A2 



FIG. 5 



INTERMEDIATE 




INTERMEDIATE 


FILE 




FILE 




TEXT IN L ANGUAGE 1:= INPUT IN RECIPE 1 

I 



-ALIGNED PAIR ALREADV>. 




1 



/write aligned PAIR 
/ TO CORPUS 



632 

STOP ^ 



7- 



62 



^58 



TEXT IN LANGUAG E 2: = OUTPUT IN RECIPE 

t 

ALIGNED PAIR: =TEXT1* TEXT 2 



7 

^59 



ALIGNED corpus! 

7 



13 



61 



THIS PAGE BUNK (uspro» 



(19 



Europaisches Patentamt 
European Patent Office 
Office europden des brevets 





® 



@ Application number : 93308661.3 
@ Date of fOing : 29.10.93 



@ Publication number : 0 597 61 1 A3 

EUROPEAN PATENT APPLICATION 

@ int.ci.^ G06F 15/38 



@ Priority : 30.10.92 GB 9222768 

® 



® 



Date of publication of application 
18.05.94 Bulletin 94/20 

Designated Contracting States : 
DE ES FR GB IT NL 



Date of deferred publication of search report ; 
21.09.94 Bulletin 94/38 

Applicant : CANON EUROPA N.V. 
Bovenkerkerweg 59-61 
NL-1185 XB Amsteiveen (NL) 

Applicant : CANON RESEARCH CENTRE 

EUROPE LIMITED 

19/20 Frederick Sanger Road, 

Surrey Research Park, 

University of Surrey 

Guildford, SY GU2 5YD (GB) 



@ Inventor : 0'Donoghue,Timothy Francis Canon 
Research Centre 
17-20 Frederick Sanger Rd. 
Surrey Research Park 
Guildford,Surrey,GU2 5YD (GB) 
Inventor : Wachtel,Thomas Juliusz Canon 
Res.Cntr.Eur.Ltd. 
17-20 Frederick Sanger Rd. 
Surrey Research Park 
Guildford,Surrey GU2 5YD (GB) 

(g) Representative : Beresford, Keith Denis Lewis 
et at 

BERESFORD & Co. 
2-5 Warwick Court 
High Holbom 
London WC1R 5DJ (GB) 



CO 
< 



in 



Q. 

Ul 



(g) Aligning texts. 



(57 



A plurality of source text files are read, repre- 
senting similar information but in different 
natural languages. The files have con-elated 
layouts, in that the same layout commands are 
employed at similar points in the files. 

Similar text, from respective files, is aligned 
by Identifying its position t>etween equivalent 
word processing commands. 

Preferably, intermediate files are produced in 
which the word processing (WP) commands are 
converted Into identifiable form. Aligned text, 
which differs between the Intermediate files 
whereas WP commands will not differ. Is iden- 
tified by a differential comparason operation, 
such as a call to DIFF within a UNIX environ- 
ment 
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